- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
llama.cpp : split llama_context_params into model and context params #3301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Despite what the prints say, it looks like the KV cache is offloaded regardless of the value of  | 
| Both the  I'm not 100% sure how to proceed. The results happen to be correct, but as I said: supporting the combination does not make sense and I would not consider it to be in-scope. So I personally would still prefer to disallow it and to adjust the program logic to match the prints. | 
| The reason this is relevant to this PR is that currently I have duplicated  There is also the issue that the only documentation of  | 
| Alright. I do not feel strongly about this and my line of reasoning is revolving around what is easier to implement and maintain anyways (assuming we want to keep  | 
| I ran some tests with  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge this together with #3228 to make the API change in one go
| Thanks for completing #2534! It looks and runs fine on my end. Note there are a few things from my PR that you may want to add here, such as README updates and including the thread count in the system info line. | 
| While we are changing the API, should we also remove all the  | 
| 
 Do you mean  | 
| /*.n_threads =*/ GGML_DEFAULT_N_THREADS, // TODO: better default | ||
| /*.n_threads_batch =*/ GGML_DEFAULT_N_THREADS, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change this to something like std::thread::hardware_concurrency(), or get_num_physical_cores() from common.cpp? It shouldn't affect most of our examples, but a better default could help downstream users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best option would be to move get_num_physical_cores() to ggml and use it.
But can be done in separate PR
| If the API changes look good I think this can be merged. | 
…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggml-org#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggml-org#3401) train : fix KQ_pos allocation (ggml-org#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggml-org#3206) readme : update hot topics + model links (ggml-org#3399) readme : add link to grammars app (ggml-org#3388) swift : fix build on xcode 15 (ggml-org#3387) build : enable more non-default compiler warnings (ggml-org#3200) ggml_tensor: update the structure comments. (ggml-org#3283) ggml : release the requested thread pool resource (ggml-org#3292) llama.cpp : split llama_context_params into model and context params (ggml-org#3301) ci : multithreaded builds (ggml-org#3311) train : finetune LORA (ggml-org#2632) gguf : basic type checking in gguf_get_* (ggml-org#3346) gguf : make token scores and types optional (ggml-org#3347) ci : disable freeBSD builds due to lack of VMs (ggml-org#3381) llama : custom attention mask + parallel decoding + no context swaps (ggml-org#3228) docs : mark code as Bash (ggml-org#3375) readme : add Mistral AI release 0.1 (ggml-org#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggml-org#3370)
…gml-org#3301) * llama.cpp : split llama_context_params into model and context params ggml-ci * fix metal build * fix freq_base/scale default to model value * llama-bench : keep the same model between tests when possible * move n_threads to llama_context_params, add n_threads_batch * fix mpi build * remove kv_size(), cuda scratch fixes * remove low-vram option * add n_threads_batch to system info, refactor to get_system_info() * add documentation about --threads-batch to the READMEs * llama-bench fix * main : fix rope freq/scale warning * llama.cpp : add llama_get_model common : add llama_tokenize from model * remove duplicated ctx/model functions ggml-ci * cuda : print total VRAM used
Currently,
llama_load_model_from_fileandllama_new_context_with_modelreceive the same parameters, but it is not specified which parameters are relevant for each object. This change splits the parameters intollama_model_paramsandllama_context_paramsso that it is clear what parameters can be modified per-context.Additionally:
n_ctx_trainas the default context size whenllama_context_params::n_ctx == 0or-c 0n_threadstollama_context_params, addsn_threads_batchand the corresponding command line argument-tb, --threads-batchfor batches/prompt processinglow_vramparameter and the corresponding-lv, --low-vramcommand line argument